Intro

Here we compute the transcription factor and pathway activity of the 5-FU treated human in-vitro data. This will give us an overview of the relevant pathways and TFs activated and responding to the drug, depending on time and concentration.

Issue with transcriptomics data

There is a problem in DE analysis: some genes with no expression gets very high log2FC. Check the human_SI_transcriptomics_issue script to see details.

We found that if we filter out the genes that have zero expression across more than one replica the issue is resolves.

Import and preprocess the transcriptomics data

First we import all the transcriptomics data of SI treated with 5-FU and Colon treated with 5-FU. This data was already preprocessed, i.e. we removed the genes where the log2 FC was artificially high, while the expression was zero.

transcriptomics_data <- import_human_transcriptomics_preprocessed_data()
sample_table <-  transcriptomics_data %>% select(concentration,sample_id,organ,time) %>% unique()

Calculate TF activity:

We use the Dorothea regulon, which contains the HGNC symbols of transcription factors and transcription factor targets. Therefore we need to map the Ensembl gene ids to HGNC. We use BiomaRt for this.

Gene name conversion

  • out of R length(tr_genes) ensembl genes, we found R tr_genes_found
  • didn’t find R tr_genes_notfound

List of missing, but significantly changing genes:

# We are missing the following ENSEMBL ids: 
missing_genes = tr_genes[!tr_genes %in% genes_dic$ensembl_gene_id]
# around 10-15 genes are significantly different between conditions:
transcriptomics_data %>% 
    filter(ensembl_gene_id %in% missing_genes) %>% 
    filter(padj<0.05) %>%
    arrange(padj)
## # A tibble: 870 x 12
##    fileID concentration  time ensembl_gene_id baseMean log2FoldChange lfcSE
##    <chr>          <dbl> <dbl> <chr>              <dbl>          <dbl> <dbl>
##  1 1000u…          1000    72 ENSG00000284526     406.          -3.85 0.190
##  2 1000u…          1000    72 ENSG00000272405     304.          -3.28 0.166
##  3 1000u…          1000    48 ENSG00000284526     406.          -3.39 0.183
##  4 1000u…          1000    72 ENSG00000238164     206.          -4.79 0.264
##  5 1000u…          1000    48 ENSG00000238164     206.          -4.21 0.234
##  6 1000u…          1000    72 ENSG00000225138     186.          -4.55 0.256
##  7 1000u…          1000    72 ENSG00000249673     185.          -3.58 0.209
##  8 1000u…          1000    48 ENSG00000272405     304.          -2.45 0.147
##  9 1000u…          1000    48 ENSG00000249673     185.          -3.22 0.197
## 10 1000u…          1000    48 ENSG00000225138     186.          -3.33 0.204
## # … with 860 more rows, and 5 more variables: stat <dbl>, pvalue <dbl>,
## #   padj <dbl>, organ <chr>, sample_id <chr>

R n_missing_HGCN genes found by EnsemblID has no HGCN name.

add the gene names to the transcription data

transcriptomics_data <- transcriptomics_data %>%
    left_join(genes_dic, by = "ensembl_gene_id")

Some Ensembl id were not found, we removed those genes and also those that have no gene names. Averaged those transcripts that have the same gene name.

Check how many dorothea regulons are in the data

regulons %>% 
    filter(target %in% dorothea_data$hgnc_symbol) %>% 
    nrow() %>%
    print()
## [1] 4875
nrow(regulons)
## [1] 6620

4873 of the 6620 TF target transcripts are found in the data.

TF estimation

Run VIPER to estimate TF activity in samples. We run the analysis on log2 FC, therefore we get TFs that are strongly up/down regulated in treatment vs control.

# data frame with the original data
viper_input = dorothea_data_hgnc  %>%
    spread(sample_id,log2FC) %>%
    column_to_rownames("hgnc_symbol")

rerun = FALSE
if(rerun){
    tf_activities <- dorothea::run_gviper(viper_input, regulons,
                                          options =  list(minsize = 5, eset.filter = FALSE, 
                                                        cores = 1, verbose = TRUE, nes = TRUE),tidy = TRUE)
    
    # attach the description of the samples
    tf_activities <- left_join(tf_activities,
                               rename(sample_table,sample = sample_id),
                               by = "sample")
    
    if(FALSE) write_rds(tf_activities,"./data/results/tf_activity_gviper_all_human.rds")
}else{
    tf_activities = read_rds("./data/results/tf_activity_gviper_all_human.rds")
}

Example:

net_e2f2 <- regulons %>% filter(tf == "E2F2") %>%
    dplyr::rename(from = "tf", to = "target")

net_e2f2_nodes <- tibble(id = unique(c(net_e2f2$from,net_e2f2$to))) %>%
    mutate(label = id) %>%
    #mutate(color = ifelse(id %in% net_e2f2$from, "red","green")) %>%
    mutate(hgnc_symbol = id) %>%
    left_join(filter(dorothea_data_hgnc,sample_id =="colon_1000uM_24h"), by= "hgnc_symbol") %>%
    mutate(size = 5 + 5*(abs(log2FC) - min(abs(log2FC),na.rm = TRUE))) %>%
    mutate(size = ifelse(is.na(size),5,size)) %>%
    mutate(color = ifelse(is.na(log2FC), "grey", ifelse(sign(log2FC) == 1,"blue","red") ))


visNetwork::visNetwork(nodes = net_e2f2_nodes,  edges = net_e2f2)%>% 
    visNetwork::visEdges(arrows = 'to', scaling = list(min = 2, max = 2),color = "mor" )

We take the 5 top TF’s from each conditions and check which are the genes that were used to determine their activities:

top_5_tf = tf_activities %>%
    group_by(sample) %>%
    top_n(5, wt = abs(activity)) %>%
    arrange(activity) %>% pull(tf) %>% unique()

regulons %>% 
    filter(tf %in% top_5_tf) %>%
    inner_join(dorothea_data %>% rename(target = hgnc_symbol),by = "target") %>%
    left_join(sample_table, by = c("concentration", "time", "organ", "sample_id")) %>%
    group_by(sample_id,tf) %>%
    mutate(lfc_rank = n()-rank(abs(log2FoldChange))) %>% 
    mutate(label = ifelse(padj<0.05 & lfc_rank<5, target, "" ) ) %>%
    arrange(desc(abs(log2FoldChange))) %>%
    
    ggplot(aes(log2FoldChange,-log10(pvalue))) +
    geom_point(aes(color=padj<0.05)) +
    ggrepel::geom_text_repel(aes(label=label)) +
    facet_grid(organ+concentration+time~tf)